Bayesian Linear Regression

Analysis of Flight Delay Data

Author

Sara Parrish (Advisor: Dr. Seals)

Published

Oct 21, 2024

Slides: slides.html ( Go to slides.qmd to edit)

Introduction

The introduction should:

  • Develop a storyline that captures attention and maintains interest.

  • Your audience is your peers

  • Clearly state the problem or question you’re addressing.

  • Introduce why it is relevant needs.

  • Provide an overview of your approach.

Example of writing including citing references:

This is an introduction to ….. regression, which is a non-parametric estimator that estimates the conditional expectation of two variables which is random. The goal of a kernel regression is to discover the non-linear relationship between two random variables. To discover the non-linear relationship, kernel estimator or kernel smoothing is the main method to estimate the curve for non-parametric statistics. In kernel estimator, weight function is known as kernel function (efr2008?). Cite this paper (bro2014principal?). The GEE (wang2014?). The PCA (daffertshofer2004pca?)

This is my work and I want to add more work…

Methods

The Frequentist Framework

Linear regression can be achieved using a variety of methods, two of interest are frequentist and Bayesian. The frequentist approach to linear regression is the more familiar approach. It estimates the effects of independent variables(predictors) on dependent variables(the outcome). The regression coefficient is a point estimate, assumed to be a fixed value. Following is the frequentist linear model

\[ Y = \beta_0 + \beta_1X + \varepsilon \tag{1} \]

  • \(Y\) : Dependent variable, the outcome
  • \(\beta_0\) : y intercept
  • \(\beta_1\) : The regression coefficient
  • \(X\) : Independent variable
  • \(\varepsilon\) : Random error (Yan and Su 2009)
  • \(\hat\beta\) provides a point estimate

The Bayesian Framework

The Bayesian approach estimates the relationship between predictors and an outcome in a similar way, however it’s regression coefficient is not a point estimate, but a distribution. That is, the regression coefficient is not assumed to be a fixed value. The Bayesian approach also goes a step further then frequentist regression in it’s inclusion of prior data. The Bayesian approach is so named because it is based on Bayes’ rule which is written as follows:

\[ Posterior = \frac{Likelihood \times Prior}{Normalization} \]

  • The \(Prior\) is model of prior knowledge on the subject
  • The \(Likelihood\) is the probability of the data given the prior
  • The \(Normalization\) is a constant that ensures the posterior distribution is a valid density function whose integration is equal to 1
  • The \(Posterior\) is the probability model that expresses an updated view of the model parameters
  • From the initial parameters of the prior

In terms of calculating probability, Bayes’ rule can be written as

\[ p(B|A) = \frac{p(A|B)\cdot p(B)}{p(A)} \tag{2} \]

  • Bayes’ rule allows for the calculation of inverse probability (\(p(B|A) \text{ from } p(A|B)\))
    • \(p(B|A) \text{ and } p(A|B)\) are conditional probabilities
    • \(p(A) \text{ and } p(B)\) are marginal probabilities (Lesaffre and Lawson 2012)

For continuous parameters, Bayes rule can be written as

\[ \begin{align*} p(\theta|y) =& \frac{ L(\theta|y)p(\theta) }{p(y)}\\ \\ p(\theta|y) \propto & \text{ }L(\theta|y)p(\theta) \end{align*} \]

The normalization constant (\(p(y)\) above) ensures the posterior distribution is a valid distribution, but the posterior density function can be written without this constant. The resulting prediction is not a point estimate, but a distribution (Bayes 1763). The Bayesian approach is derived with Bayes’ theorem wherein the posterior distribution, the updated belief about the parameter given the data \(p(\theta|y)\), is proportional to the likelihood of \(\theta\) given \(y\), \(L(\theta|y)\), and the prior density of \(\theta\), \(p(\theta)\). The former is known as the likelihood function and would comprise the new data for analysis while the latter allows for the incorporation of prior knowledge regarding \(\theta\)(Yan and Su 2009).

\[ f(\theta|D) \propto L(\theta|D)f(\theta) \tag{1} \]

The Model

To generate a model for our analysis, we start with the normal data model \(Y_i|\beta_0, \beta_1, \sigma \sim N(\mu, \sigma^2)\) and include a the mean specific to our predictor, departure time, \(\mu_i\). The model is:

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \end{align*} \] Where: - \(Y_i\) is the arrival delay for the i-th flight - \(X_i\) is the departure delay for the i-th flight - \(\mu_i = \beta_0 + \beta_1X_i\) is the local mean arrival delay, , specific to the departure time - \(\sigma^2\) is the variance of the errors - \(\overset{\text{ind}}{\sim}\) indicates conditional independence of each arrival delay with the given parameters

Prior Selection

This analysis will explore the differences in Bayesian linear regression using flat priors and tuned priors.

Since we are only using two data variables, arrival delay and departure time, the regression parameters will be \(\beta_0\), \(\beta_1\), and \(\sigma\) for intercept, slope, and error As intercept and slope regression parameters can take any real value, we will use normal prior models (Johnson, Ott, and Dogucu 2022). \[ \begin{align*} \beta_0 &\sim N(m_0, s^2_0)\\ \beta_1 &\sim N(m_1, s^2_1) \end{align*} \]

where \(m_0, s_0, m_1, \text{and } s_1\) are hyperparameters.

The standard deviation parameter must be positive, so we will use an exponential model (Johnson, Ott, and Dogucu 2022).

\[ \sigma \sim \text{Exp}(l) \]

Due to the fact that the exponential model is a special case of the Gamma model, with \(s = 1\), we can use the definitions of the mean and variance of the gamma model to to find that of the exponential model (Johnson, Ott, and Dogucu 2022).

\[ \begin{align*} E(\sigma) = \frac{1}{l} && \text{and} && SD(\sigma) = \frac{1}{l} \end{align*} \]

The Bayesian Linear Regression Model

The model can be written as

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(m_0, s_0^2)\\ \beta_1 &\sim N(m_1, s_1^2)\\ \sigma &\sim \text{Exp}(l) \end{align*} \]

Tuning Hyperparameters

\(\beta_0\) informs the model intercept
Code
summary(Delays_sample$DEP_TIME_MINS) #mean departure time is 809.3 minutes (~ 1:30pm)

Delays_sample_filtered_B0 <- subset(Delays_sample, DEP_TIME_MINS >= 800 & DEP_TIME_MINS <= 820)

mean(Delays_sample_filtered_B0$ARR_DELAY) #m_0c = 2
sd(Delays_sample_filtered_B0$ARR_DELAY)  #s_0c = 36

\(\beta_{0c}\) reflects the typical arrival delay at a typical departure time. With a mean departure time at \(\sim\) 1:30pm, the average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 36 minutes.

\[ \beta_{0c} \sim N(2, 36^2) \]

\(\beta_1\) informs the model slope
Code
lm_model <- lm(ARR_DELAY ~ DEP_TIME_MINS, data = Delays_sample)

summary(lm_model)

coef(lm_model)["DEP_TIME_MINS"] #m_1 = 0.01903
summary(lm_model)$coefficients["DEP_TIME_MINS", "Std. Error"] #s_1 = 0.0005

The slope of the lineal model indicates a 0.019 minute increase in arrival delay per minute increase in departure time, so we set \(m_1 = 0.02\). The standard error reflects high confidence at 0.0005, but as to not limit the model we will set it lower at \(s_1 = 0.01\).

\[ \beta_{1} \sim N(0.02, 0.01^2) \]

\(\sigma\) informs the model standard deviation
Code
summary(lm_model)$sigma

To tune the exponential model, we set the expected value of the standard deviation, $ E() $, equal to the residual standard error, \(\sim 50\). With this, we can find the rate parameter, \(l\).

\[ \begin{align*} E(\sigma) &= \frac{1}{l} = 50\\\\ l &= \frac{1}{50} = 0.02\\\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

The Updated Model

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(2, 36^2)\\ \beta_1 &\sim N(0.02, 0.01^2)\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

Statistical Programming

Data was collected by the Bureau of Transportation Statistics (BTS) and accessed through a dataset compiled by Patrick Zelazko (Zelazko 2023). The data was imported into R (R Core Team 2023) via CSV. This is a large time-series dataset with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minutes and attributed reasons for delay. The function stan_glm() was used for simulation of the Normal Bayesian linear regression model from the “rstanarm” library(Brilleman et al. 2018). This function runs the Markov Chain Monte Carlo simulation as well with specified chains, iterations, and the ability to set a seed. These were set to 4 chains, 2000 iterations, and the seed was set to 123. Simulation of the posterior was done with the posterior_predict() function, also from the “rstanarm” library(Brilleman et al. 2018). Evaluation of the model was done by considering the data and it’s source, the assumptions of the model, and the accuracy of the prediction. The posterior predictions were evaluated with the prediction_summary() function from the “bayesrules”library (Dogucu, Johnson, and Ott 2021). This provided median absolute error (MAE) scaled MAE, and the proportion of values that fall within 50% and 95% confidence intervals.

Possibly k-fold cross validation, model averaging

Analysis and Results

Exploratory Data Analysis

Following are the definitions of the given variables in this dataset.

Header Description
Fl Date Flight Date (yyyy-mm-dd)
Airline Airline Name
Airline DOT Airline Name and Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
Airline Code Unique Carrier Code
DOT Code An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
Fl Number Flight Number
Origin Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Origin City Origin City Name, State Code
Dest Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Dest City Destination City Name, State Code
CRS Dep Time CRS Departure Time (local time: hhmm)
Dep Time Actual Departure Time (local time: hhmm)
Dep Delay Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
Taxi Out Taxi Out Time, in Minutes
Wheels Off Wheels Off Time (local time: hhmm)
Wheels On Wheels On Time (local time: hhmm)
Taxi In Taxi In Time, in Minutes
CRS Arr Time CRS Arrival Time (local time: hhmm)
Arr Time Actual Arrival Time (local time: hhmm)
Arr Delay Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
Cancelled Cancelled Flight Indicator (1=Yes)
Cancellation Code Specifies The Reason For Cancellation
Diverted Diverted Flight Indicator (1=Yes)
CRS Elapsed Time CRS Elapsed Time of Flight, in Minutes
Actual Elapsed Time Elapsed Time of Flight, in Minutes
Air Time Flight Time, in Minutes
Distance Distance between airports (miles)
Carrier Delay Carrier Delay, in Minutes
Weather Delay Weather Delay, in Minutes
NAS Delay National Air System Delay, in Minutes
Security Delay Security Delay, in Minutes
Late Aircraft Delay Late Aircraft Delay, in Minutes

Table 1 for the dataset.

Code
Table1.2_total <- Delays1 %>%
  summarise(
    .by = NULL,
    flight_period = "Total",
    TotalFlights = n(),
    TotalUniqueDates = n_distinct(fl_date),
    TotalUniqueOrigins = n_distinct(origin),
    TotalUniqueDestinations = n_distinct(dest),
    AvgCRSDepTime = mean(crs_dep_time, na.rm = TRUE),
    AvgDepTime = mean(dep_time, na.rm = TRUE),
    AvgDepDelay = round(mean(dep_delay, na.rm = TRUE), 2),
    AvgTaxiOut = round(mean(taxi_out, na.rm = TRUE), 2),
    AvgTaxiIn = round(mean(taxi_in, na.rm = TRUE), 2),
    AvgCRSArrTime = mean(crs_arr_time, na.rm = TRUE),
    AvgArrTime = mean(arr_time, na.rm = TRUE),
    AvgArrDelay = round(mean(arr_delay, na.rm = TRUE), 2),
    AvgAirTime = round(mean(air_time, na.rm = TRUE), 2),
    CancelledFlights = sum(cancelled, na.rm = TRUE),
    DivertedFlights = sum(diverted, na.rm = TRUE), 
    AvgCarrierDelay = round(mean(carrier_delay, na.rm = TRUE), 2),
    AvgSecurityDelay = round(mean(security_delay, na.rm = TRUE), 2),
    AvgWeatherDelay = round(mean(weather_delay, na.rm = TRUE), 2),
    AvgNASDelay = round(mean(nas_delay, na.rm = TRUE), 2),
    AvgLateAircraftDelay = round(mean(lateaircraft_delay, na.rm = TRUE), 2),
    CarrierDelay_ct = sum(carrier_delay > 0),
    SecurityDelay_ct = sum(security_delay > 0),
    WeatherDelay_ct = sum(weather_delay > 0),
    NASDelay_ct = sum(nas_delay > 0),
    LateAircraftDelay_ct = sum(lateaircraft_delay > 0)) %>%
  mutate(
    TotalFlightsCount = sprintf("%d (100%%)", TotalFlights),
    CancelledFlightsCount = sprintf("%d (100%%)", CancelledFlights),
    DivertedFlightsCount = sprintf("%d (100%%)", DivertedFlights),
    CarrierDelayCount = sprintf("%d (100%%)", CarrierDelay_ct),
    SecurityDelayCount = sprintf("%d (100%%)", SecurityDelay_ct),
    WeatherDelayCount = sprintf("%d (100%%)", WeatherDelay_ct),
    NASDelayCount = sprintf("%d (100%%)", NASDelay_ct),
    LateAircraftDelayCount = sprintf("%d (100%%)", LateAircraftDelay_ct)
  )

Table1.2_combined <- bind_rows(Table1.2, Table1.2_total)



library(lubridate)

# Converting time HHMM.SS to HH:MM:SS

convert_to_time <- function(time_val) {
  rounded_time <- round(time_val, 2)
  hours <- floor(rounded_time / 100)
  minutes_with_secs <- (rounded_time %% 100)
  minutes <- floor(minutes_with_secs)
  seconds <- round((minutes_with_secs - minutes) * 60, 0)
  time_formatted <- sprintf("%02d:%02d:%02d", hours, minutes, seconds)
  return(time_formatted)
}

#Apply time conversion, remove extra rows

Table1.3_combined <- Table1.2_combined %>%
  mutate(
    AvgCRSDepTime = sapply(AvgCRSDepTime, convert_to_time),
    AvgDepTime = sapply(AvgDepTime, convert_to_time),
    AvgCRSArrTime = sapply(AvgCRSArrTime, convert_to_time),
    AvgArrTime = sapply(AvgArrTime, convert_to_time),
  ) %>%
  mutate(across(-flight_period, as.character)
  ) %>%
  select(
    flight_period,
    TotalFlightsCount,
    CancelledFlightsCount,
    DivertedFlightsCount,
    AvgCRSDepTime,
    AvgDepTime,
    AvgDepDelay,
    AvgTaxiOut,
    AvgTaxiIn,
    AvgCRSArrTime,
    AvgArrTime,
    AvgArrDelay,
    AvgAirTime,
    CarrierDelayCount,
    SecurityDelayCount,
    WeatherDelayCount,
    NASDelayCount,
    LateAircraftDelayCount
  )

#Pivot table

Table1.3_pivoted <- Table1.3_combined %>% 
  pivot_longer(
    cols = -flight_period,
    names_to = "Statistic", 
    values_to = "Value") %>% 
  pivot_wider(
    names_from = flight_period,
    values_from = Value
)

#gt Table1

Table1.3_pivoted %>%
  gt() %>%
  tab_header(
    title = "Flight Delay Summary by Flight Period"
  ) %>%
  cols_label(
    Statistic = "Flight Period",
    Morning = "Morning",
    Afternoon = "Afternoon",
    Evening = "Evening",
    Total = "Total"
  ) %>%
  tab_spanner(
    label = "Flight Period",
    columns = c(Morning, Afternoon, Evening, Total)
  ) %>%
  tab_style(
    style = list(
      cell_text(color = "white"), 
      cell_fill(color = "rgba(0, 43, 54, 1)")
    ),
    locations = cells_body(
      columns = everything()
    )
  ) %>%
  tab_style(
    style = list(
      cell_text(color = "white"),
      cell_fill(color = "rgba(0, 43, 54, 1)")
    ),
    locations = cells_column_labels(
      columns = everything()
    )
  ) %>%
  tab_style(
    style = list(
      cell_text(color = "white", weight = "bold"),
      cell_fill(color = "rgba(0, 43, 54, 1)")
    ),
    locations = cells_title(
      groups = c("title", "subtitle")
    )
  ) %>%
  tab_style(
    style = list(
      cell_text(color = "white", weight = "bold"),
      cell_fill(color = "rgba(0, 43, 54, 1)")
    ),
    locations = cells_column_spanners(
      spanners = everything()
    )
  ) %>%
  tab_source_note(
    source_note = "Table 1: Summary includes morning, afternoon, and evening flight periods."
  )
Flight Delay Summary by Flight Period
Flight Period
Flight Period
Morning Afternoon Evening Total
TotalFlightsCount 1246031 (41.5%) 1423140 (47.4%) 330829 (11.0%) 3000000 (100%)
CancelledFlightsCount 30690 (38.8%) 38343 (48.4%) 10107 (12.8%) 79140 (100%)
DivertedFlightsCount 2555 (36.2%) 3901 (55.3%) 600 (8.5%) 7056 (100%)
AvgCRSDepTime 08:49:31 15:73:19 20:66:23 13:27:04
AvgDepTime 08:53:58 15:89:05 20:12:40 13:29:47
AvgDepDelay 5.23 12.93 16.51 10.12
AvgTaxiOut 16.87 16.44 16.65 16.64
AvgTaxiIn 7.75 7.78 6.95 7.68
AvgCRSArrTime 10:87:15 17:85:11 17:42:14 14:90:34
AvgArrTime 10:86:01 17:71:56 15:89:47 14:66:31
AvgArrDelay -0.77 7.34 10.04 4.26
AvgAirTime 114.12 109.8 116.31 112.31
CarrierDelayCount 86824 (29.2%) 162266 (54.6%) 47861 (16.1%) 296951 (100%)
SecurityDelayCount 887 (32.1%) 1434 (52.0%) 438 (15.9%) 2759 (100%)
WeatherDelayCount 8380 (26.7%) 18758 (59.7%) 4290 (13.7%) 31428 (100%)
NASDelayCount 80604 (31.4%) 144366 (56.3%) 31507 (12.3%) 256477 (100%)
LateAircraftDelayCount 42721 (16.5%) 168902 (65.2%) 47391 (18.3%) 259014 (100%)
Table 1: Summary includes morning, afternoon, and evening flight periods.

The three flight periods are each comprised of 8-hour segments (i.e. Morning has flights with departure times from 4am to noon followed by afternoon and evening). The Afternoon period is comprised of the most flights (47.4%), followed closely by the Morning period (41.5%), and the Evening period trails the two (11%). The table also gives the means of the departure and arrival times, giving an indication of the density of the flights in the given period. The average departure and arrival delays show much better numbers for the Morning period (5.23, -0.77 minutes) with increasing delays for the Afternoon and Evening periods. The delay counts by type show That the Afternoon and Morning periods account for significantly more of the total delays, though that is without taking into account the smaller contribution of flights by the Evening period on the whole.

Some Visualizations of the Dataset

These histograms illustrate the frequencies of air time, arrival delays, and departure delays. The y-axis was transformed to make the visualizations more legible. All show a skew to the right. This makes sense for air times with a higher proportion of regional flights and the exclusion of international departures and arrivals. Shorter delays (for both arrivals and departures) being more frequent than longer delays is also to be expected.

This visualization shows the average arrival delay for the largest five airlines (filtered for carriers with over 200,000 flights in the given period). The standard deviations for these airlines are fairly small, indicating a low variability in the arrival delays for these airlines.

This heat map shows the average arrival delay for flights at their origin airport. This comes from the idea that if a flight is delayed at departure, then it may also be delayed on arrival at it’s destination.

Correlation Matrix for Continuous Variables:
                  DEP_DELAY   TAXI_OUT    TAXI_IN    ARR_DELAY CRS_ELAPSED_TIME
DEP_DELAY        1.00000000 0.04483006 0.01783299  0.965628452      0.022205036
TAXI_OUT         0.04483006 1.00000000 0.02466135  0.186389179      0.079238740
TAXI_IN          0.01783299 0.02466135 1.00000000  0.110128380      0.102555059
ARR_DELAY        0.96562845 0.18638918 0.11012838  1.000000000     -0.003073467
CRS_ELAPSED_TIME 0.02220504 0.07923874 0.10255506 -0.003073467      1.000000000
ELAPSED_TIME     0.02609654 0.18252089 0.16957221  0.049369903      0.982448199
AIR_TIME         0.01928498 0.05349243 0.08056859  0.016201773      0.989281781
DISTANCE         0.02002126 0.04030996 0.07296821  0.001217362      0.982538270
                 ELAPSED_TIME   AIR_TIME    DISTANCE
DEP_DELAY          0.02609654 0.01928498 0.020021260
TAXI_OUT           0.18252089 0.05349243 0.040309965
TAXI_IN            0.16957221 0.08056859 0.072968208
ARR_DELAY          0.04936990 0.01620177 0.001217362
CRS_ELAPSED_TIME   0.98244820 0.98928178 0.982538270
ELAPSED_TIME       1.00000000 0.98764824 0.969600832
AIR_TIME           0.98764824 1.00000000 0.983888247
DISTANCE           0.96960083 0.98388825 1.000000000

Testing between AIRLINE_CODE and DEP_DELAY :

Testing between AIRLINE_CODE and TAXI_OUT :

Testing between AIRLINE_CODE and TAXI_IN :

Testing between AIRLINE_CODE and ARR_DELAY :

Testing between AIRLINE_CODE and CRS_ELAPSED_TIME :

Testing between AIRLINE_CODE and ELAPSED_TIME :

Testing between AIRLINE_CODE and AIR_TIME :

Testing between AIRLINE_CODE and DISTANCE :

Testing between ORIGIN and DEP_DELAY :

Testing between ORIGIN and TAXI_OUT :

Testing between ORIGIN and TAXI_IN :

Testing between ORIGIN and ARR_DELAY :

Testing between ORIGIN and CRS_ELAPSED_TIME :

Testing between ORIGIN and ELAPSED_TIME :

Testing between ORIGIN and AIR_TIME :

Testing between ORIGIN and DISTANCE :

Testing between DEST and DEP_DELAY :

Testing between DEST and TAXI_OUT :

Testing between DEST and TAXI_IN :

Testing between DEST and ARR_DELAY :

Testing between DEST and CRS_ELAPSED_TIME :

Testing between DEST and ELAPSED_TIME :

Testing between DEST and AIR_TIME :

Testing between DEST and DISTANCE :

Testing between DELAY_DUE_CARRIER and DEP_DELAY :

Testing between DELAY_DUE_CARRIER and TAXI_OUT :

Testing between DELAY_DUE_CARRIER and TAXI_IN :

Testing between DELAY_DUE_CARRIER and ARR_DELAY :

Testing between DELAY_DUE_CARRIER and CRS_ELAPSED_TIME :

Testing between DELAY_DUE_CARRIER and ELAPSED_TIME :

Testing between DELAY_DUE_CARRIER and AIR_TIME :

Testing between DELAY_DUE_CARRIER and DISTANCE :

Testing between DELAY_DUE_WEATHER and DEP_DELAY :

Testing between DELAY_DUE_WEATHER and TAXI_OUT :

Testing between DELAY_DUE_WEATHER and TAXI_IN :

Testing between DELAY_DUE_WEATHER and ARR_DELAY :

Testing between DELAY_DUE_WEATHER and CRS_ELAPSED_TIME :

Testing between DELAY_DUE_WEATHER and ELAPSED_TIME :

Testing between DELAY_DUE_WEATHER and AIR_TIME :

Testing between DELAY_DUE_WEATHER and DISTANCE :

Testing between DELAY_DUE_NAS and DEP_DELAY :

Testing between DELAY_DUE_NAS and TAXI_OUT :

Testing between DELAY_DUE_NAS and TAXI_IN :

Testing between DELAY_DUE_NAS and ARR_DELAY :

Testing between DELAY_DUE_NAS and CRS_ELAPSED_TIME :

Testing between DELAY_DUE_NAS and ELAPSED_TIME :

Testing between DELAY_DUE_NAS and AIR_TIME :

Testing between DELAY_DUE_NAS and DISTANCE :

Testing between DELAY_DUE_SECURITY and DEP_DELAY :

Testing between DELAY_DUE_SECURITY and TAXI_OUT :

Testing between DELAY_DUE_SECURITY and TAXI_IN :

Testing between DELAY_DUE_SECURITY and ARR_DELAY :

Testing between DELAY_DUE_SECURITY and CRS_ELAPSED_TIME :

Testing between DELAY_DUE_SECURITY and ELAPSED_TIME :

Testing between DELAY_DUE_SECURITY and AIR_TIME :

Testing between DELAY_DUE_SECURITY and DISTANCE :

Testing between DELAY_DUE_LATE_AIRCRAFT and DEP_DELAY :

Testing between DELAY_DUE_LATE_AIRCRAFT and TAXI_OUT :

Testing between DELAY_DUE_LATE_AIRCRAFT and TAXI_IN :

Testing between DELAY_DUE_LATE_AIRCRAFT and ARR_DELAY :

Testing between DELAY_DUE_LATE_AIRCRAFT and CRS_ELAPSED_TIME :

Testing between DELAY_DUE_LATE_AIRCRAFT and ELAPSED_TIME :

Testing between DELAY_DUE_LATE_AIRCRAFT and AIR_TIME :

Testing between DELAY_DUE_LATE_AIRCRAFT and DISTANCE :

Testing between CRS_DEP_HOUR and DEP_DELAY :

Testing between CRS_DEP_HOUR and TAXI_OUT :

Testing between CRS_DEP_HOUR and TAXI_IN :

Testing between CRS_DEP_HOUR and ARR_DELAY :

Testing between CRS_DEP_HOUR and CRS_ELAPSED_TIME :

Testing between CRS_DEP_HOUR and ELAPSED_TIME :

Testing between CRS_DEP_HOUR and AIR_TIME :

Testing between CRS_DEP_HOUR and DISTANCE :

Testing between DEP_HOUR and DEP_DELAY :

Testing between DEP_HOUR and TAXI_OUT :

Testing between DEP_HOUR and TAXI_IN :

Testing between DEP_HOUR and ARR_DELAY :

Testing between DEP_HOUR and CRS_ELAPSED_TIME :

Testing between DEP_HOUR and ELAPSED_TIME :

Testing between DEP_HOUR and AIR_TIME :

Testing between DEP_HOUR and DISTANCE :

Testing between WHEELS_OFF_HOUR and DEP_DELAY :

Testing between WHEELS_OFF_HOUR and TAXI_OUT :

Testing between WHEELS_OFF_HOUR and TAXI_IN :

Testing between WHEELS_OFF_HOUR and ARR_DELAY :

Testing between WHEELS_OFF_HOUR and CRS_ELAPSED_TIME :

Testing between WHEELS_OFF_HOUR and ELAPSED_TIME :

Testing between WHEELS_OFF_HOUR and AIR_TIME :

Testing between WHEELS_OFF_HOUR and DISTANCE :

Testing between WHEELS_ON_HOUR and DEP_DELAY :

Testing between WHEELS_ON_HOUR and TAXI_OUT :

Testing between WHEELS_ON_HOUR and TAXI_IN :

Testing between WHEELS_ON_HOUR and ARR_DELAY :

Testing between WHEELS_ON_HOUR and CRS_ELAPSED_TIME :

Testing between WHEELS_ON_HOUR and ELAPSED_TIME :

Testing between WHEELS_ON_HOUR and AIR_TIME :

Testing between WHEELS_ON_HOUR and DISTANCE :

Testing between CRS_ARR_HOUR and DEP_DELAY :

Testing between CRS_ARR_HOUR and TAXI_OUT :

Testing between CRS_ARR_HOUR and TAXI_IN :

Testing between CRS_ARR_HOUR and ARR_DELAY :

Testing between CRS_ARR_HOUR and CRS_ELAPSED_TIME :

Testing between CRS_ARR_HOUR and ELAPSED_TIME :

Testing between CRS_ARR_HOUR and AIR_TIME :

Testing between CRS_ARR_HOUR and DISTANCE :

Testing between ARR_HOUR and DEP_DELAY :

Testing between ARR_HOUR and TAXI_OUT :

Testing between ARR_HOUR and TAXI_IN :

Testing between ARR_HOUR and ARR_DELAY :

Testing between ARR_HOUR and CRS_ELAPSED_TIME :

Testing between ARR_HOUR and ELAPSED_TIME :

Testing between ARR_HOUR and AIR_TIME :

Testing between ARR_HOUR and DISTANCE :

Testing between FLIGHT_PERIOD and DEP_DELAY :

Testing between FLIGHT_PERIOD and TAXI_OUT :

Testing between FLIGHT_PERIOD and TAXI_IN :

Testing between FLIGHT_PERIOD and ARR_DELAY :

Testing between FLIGHT_PERIOD and CRS_ELAPSED_TIME :

Testing between FLIGHT_PERIOD and ELAPSED_TIME :

Testing between FLIGHT_PERIOD and AIR_TIME :

Testing between FLIGHT_PERIOD and DISTANCE :

Modeling and Results

  • Explain your data preprocessing and cleaning steps.

  • Present your key findings in a clear and concise manner.

  • Use visuals to support your claims.

  • Tell a story about what the data reveals.

Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References

Bayes, T. 1763. “An Essay Towards Solving a Problem in the Doctrine of Chances. 1763.”
Brilleman, SL, MJ Crowther, M Moreno-Betancur, J Buros Novik, and R Wolfe. 2018. “Joint Longitudinal and Time-to-Event Models via Stan.” https://github.com/stan-dev/stancon_talks/.
Dogucu, Mine, Alicia Johnson, and Miles Ott. 2021. Bayesrules: Datasets and Supplemental Functions from Bayes Rules! Book. https://github.com/bayes-rules/bayesrules.
Johnson, Alicia A, Miles Q Ott, and Mine Dogucu. 2022. Bayes Rules!: An Introduction to Bayesian Modeling with R. Chapman & Hall.
Lesaffre, Emmanuel, and Andrew B Lawson. 2012. Bayesian Biostatistics. 1st ed. Somerset: John Wiley & Sons, Ltd. https://doi.org/https://doi.org/10.1002/9781119942412.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Yan, Xin, and Xiao Gang Su. 2009. Linear Regression Analysis: Theory and Computing. Singapore: World Scientific Publishing. https://ebookcentral.proquest.com/lib/uwf/reader.action?docID=477274&ppg=318&pq-origsite=primo.
Zelazko, Patrick. 2023. “Flight Delay and Cancellation Dataset (2019-2023).” https://www.kaggle.com/datasets/patrickzel/flight-delay-and-cancellation-dataset-2019-2023.